In case you weren’t around on Monday

Why generalised linear models?

  • Unified framework of advanced statistical techniques that covers a wider range of data-analysis problems.
    • Linear regression
    • General linear models: certain problems can be easily accommodated (e.g. varying slope models aka ANCOVA).
    • Generalised linear models: transformations applied to linear models allows modelling of data-types including loglinear models, categorical, ordinal, discrete frequency data.
    • Multi-level generalised linear models: data with inherently hierarchical structure (repeated measures.
  • Backbone of statistical modelling: machine learning, time series models, path analysis, structural equation models, factor analysis, signal detection models.
  • These models arise as special cases of a single underlying general principle.
  • Apply this general principle to a wider range of data analysis problems.

Normal linear model (formal description)

  • Model of how the distribution of one variable, known as the outcome variable, varies as a function of other variables, known as the explanatory or predictor variables.

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu, \sigma^2) \end{aligned} \]

  • \(y\): observed outcome variable
  • Each observation \(y_i\) is a sample (indexed by \(i\))
  • Lower case letters indicate observed variables.
  • “\(\sim\)”: “is proportional to” / “is a function of”
  • \(\mathcal{N}\): univariate normal distributed probability model; i.e. the underlying process that describes the nature of our data.
  • Population parameters are represented as Greek letters; i.e. mean \(\mu\) and variance \(\sigma^2\) (parameters depend on probability model and parameterisation).

Normal linear model (formal description)

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i &= \beta_0 + \beta_1 \cdot x_{1i} \\ \end{aligned} \]

Value of \(\mu_i\) can be decomposed using a linear regression equation.

  • \(\beta_1\) directly depends on explanatory variable \(x_{1i}\)
  • \(\beta_0\) is “constant”
  • If \(x_{1i}\) takes on 0, \(\mu_i = \beta_0\)
  • If \(x_{1i}\) takes on 1, \(\mu_i = \beta_0 + \beta_1\)
  • \(\beta_1\) is the difference between \(x_{1i}=1\) and \(x_{1i}=0\).
  • Value of \(\mu_i\) is deterministic (“\(=\)”) linear function of values of one or more predictor variables \(x\) and the unknowns \(\beta_0\) (intercept) and \(\beta_1\) (slope).

Simulation of normal distributed data

  • We can use lm to easily implement such a model.
model <- lm(y ~ x, data = data)
  • We can simulate data that exactly meet model assumptions (normal distribution, equal variance, independence etc.) with known model parameters.

Simulation of normal distributed data

  • rnorm generates unimodal, symmetrically distributed values.
  • r in rnorm = random; norm = normal
# Generate data
y <- rnorm(n = 1000, mean = 550, sd = 10)

Simulation of normal distributed data

  • rnorm generates unimodal, symmetrically distributed values.
  • r in rnorm = random; norm = normal
# Decompose the mean
beta_0 = 500
beta_1 = 50
# Generate data
y <- rnorm(n = 1000, mean = beta_0 + beta_1, sd = 10)

Simulation of normal distributed data

# Set parameter values
n <- 1000
beta_0 <- 500 
beta_1 <- 50
sigma <- 100
# Random data for group 1
group_1 <- rnorm(n = n/2, mean = beta_0, sd = sigma)
# Random data for group 2
group_2 <- rnorm(n = n/2, mean = beta_0 + beta_1, sd = sigma)
# Generate data
sim_data <- tibble(group_1 = group_1,
                   group_2 = group_2) 

# Preview data
glimpse(sim_data)
Rows: 500
Columns: 2
$ group_1 <dbl> 493, 510, 500, 483, 447, 276, 484, 377, 597, 479, 332, 629, 61…
$ group_2 <dbl> 569, 528, 654, 610, 367, 356, 617, 295, 633, 454, 570, 610, 58…
# Change data format
sim_data <- pivot_longer(sim_data, cols = c(group_1, group_2), names_to = "x", values_to = "y")

# Preview data
glimpse(sim_data)
Rows: 1,000
Columns: 2
$ x <chr> "group_1", "group_2", "group_1", "group_2", "group_1", "group_2", "g…
$ y <dbl> 493, 569, 510, 528, 500, 654, 483, 610, 447, 367, 276, 356, 484, 617…

Simulation of normal distributed data

# As a reminder, these are the parameter values
beta_0 <- 500 
beta_1 <- 50
sigma <- 100

lm uses maximum likelihood estimation to determine for which set of parameter values are the data most likely.

# Normal linear model of simulated data
model <- lm(y ~ x, data = sim_data)

coef(model) # Model coefficients
(Intercept)    xgroup_2 
      502.2        58.6 
sigma(model) # Standard deviation
[1] 97.9

Exercise

Complete script linear-models/exercises/normal_model_simulation.R

Observe how changing the number of observations affects the model estimates.

Real data

  • In real life we don’t know the true parameter values.
  • We also don’t know the true process that generates the data.
  • For the simulated data we know the process that generated the data.
    • normal distribution because rnorm samples normally distributed data
    • both groups have the same variance because sigma was the same.
  • We can use linear models to estimate parameter values from the data by which we make assumptions about the process that generates the data.

Running example: Martin et al. (2010)

Martin et al. (2010) data (Experiment 3a)

data <- read_csv("../data/martin-etal-2010-exp3a.csv") %>% glimpse()
Rows: 1,036
Columns: 4
$ item   <dbl> 11, 11, 11, 11, 11, 11, 25, 25, 25, 25, 25, 37, 37, 37, 37, 5, …
$ ppt    <dbl> 10, 9, 11, 6, 7, 2, 10, 9, 11, 7, 2, 10, 9, 6, 2, 10, 9, 11, 6,…
$ rt     <dbl> 1055, 2010, 461, 977, 1152, 1079, 1194, 1267, 1160, 684, 1367, …
$ nptype <chr> "conjoined", "conjoined", "conjoined", "conjoined", "conjoined"…
summarise(data, across(c(ppt, item), list(n = ~length(unique(.))), .names = "{col}"))
# A tibble: 1 × 2
    ppt  item
  <int> <int>
1    12    48

Aggregation by participants (across items)

In standard repeated measures ANOVAs only one source of variance can be modelled at a time.

For example one would aggregate across items and by-participant neglecting the by-items variance.

data_ppt <- data %>% summarise(rt = mean(rt), .by = c(nptype, ppt)) 
data_ppt
# A tibble: 24 × 3
  nptype      ppt    rt
  <chr>     <dbl> <dbl>
1 conjoined    10  944.
2 conjoined     9 1298.
3 conjoined    11 1410.
4 conjoined     6 1000.
5 conjoined     7  922.
# ℹ 19 more rows

Participant variation (by-participant means)

Model description

Each rt observation \(i \in 1 \dots N\) can be said to come a normal distribution with a mean \(\mu\) and a standard deviation \(\sigma^2\)

\[ \text{rt}_i \sim \mathcal{N}(\mu_i, \sigma^2)\\ \]

Using the linear regression equation, the mean \(\mu\) can be decomposed into

\[ \mu_i = \beta_0 + \beta_1 \cdot \text{nptype}_i + \text{ppt}_i\\ \]

where \(\text{nptype}_i\) is dummy coded (by default as treatment contrast in alphabetical order)

\[ \text{nptype}_i = \left\{ \begin{array}{ll} 0, \text{ if nptype}_i = \texttt{conjoined}\\ 1, \text{ if nptype}_i = \texttt{simple}\\ \end{array} \right. \]

\(\beta_0\) is the average rt for conjoined NPs because

\(\beta_0 + \beta_1 \cdot 0 = \beta_0\),

\(\beta_1\) is the difference in rts between conjoined and simple NPs. Therefore, the average rt for simple NPs is

\(\beta_0 + \beta_1 \cdot 1 = \beta_0 + \beta_1\),

and finally \(\text{ppt}_i \sim \mathcal{N}(0, \sigma^2_\text{ppt})\)

Model implementation in R

Repeated measures ANOVA:

library(afex)
m <- aov_car(rt ~ Error(ppt/nptype), data = data_ppt)
summary(m)
Univariate Type III Repeated-Measures ANOVA Assuming Sphericity

              Sum Sq num Df Error SS den Df F value  Pr(>F)    
(Intercept) 28783750      1   587304     11  539.11 1.1e-10 ***
nptype          6217      1    13043     11    5.24   0.043 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear mixed-effects model@

library(lme4)
m <- lmer(rt ~ nptype + (1|ppt), data = data_ppt)
anova(m)
Type III Analysis of Variance Table with Satterthwaite's method
       Sum Sq Mean Sq NumDF DenDF F value Pr(>F)  
nptype   6217    6217     1    11    5.24  0.043 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise on repeated measures ANOVA

Complete exercise script

linear-models/exercises/repeated_measures_model.R

and after that

linear-models/exercises/mixed_effects_model.R

Item variation (by-image sets means)

What’s a mixed effects model?

  • Models with a combination of fixed (intercepts and slopes) and random effects.
  • Repeated measures ANOVA are a special type of mixed-effects models.
  • Linear models is not as contrained as ANOVAs.
  • Fixed effects: systematic / deterministic effects (e.g. age, trial id, grouping variables)
  • Random effects: non-systematic effects (e.g. ppts, items, class, country)
  • Random effects can be nested: people within country
  • Sometimes the differences can be blurry: comparing differences between two test sites rather than treating test sites as repeated measures variable.

Crossed random intercepts for participants and items

  • Repeated-measures ANOVA can only handle either participant or item variance
  • Linear mixed-effects models can capture both (Baayen, Davidson, and Bates 2008)
m <- lmer(rt ~ nptype + (1|ppt) + (1|item), data = data)
summary(m)$coef
             Estimate Std. Error    df t value Pr(>|t|)
(Intercept)    1111.8       48.0  11.9    23.2 3.11e-11
nptypesimple    -33.4       14.5 977.9    -2.3 2.16e-02

How to get null-hypothesis for lmer models?

  • Likelihood ratio test
m_null <- lmer(rt ~ 1 + ( 1 | ppt ) + ( 1 | item ), data = data)
anova(m_null, m)
Data: data
Models:
m_null: rt ~ 1 + (1 | ppt) + (1 | item)
m: rt ~ nptype + (1 | ppt) + (1 | item)
       npar   AIC   BIC logLik deviance Chisq Df Pr(>Chisq)  
m_null    4 14319 14339  -7155    14311                      
m         5 14316 14340  -7153    14306  5.28  1      0.022 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Satterthwaite approximation for degrees of freedom (lmerTest)
m <- lmerTest::lmer(rt ~ nptype + (1|ppt) + (1|item), data = data)
summary(m)$coef
             Estimate Std. Error    df t value Pr(>|t|)
(Intercept)    1111.8       48.0  11.9    23.2 3.11e-11
nptypesimple    -33.4       14.5 977.9    -2.3 2.16e-02
  • t-values larger than |2| roughly correspond to a significance level of \(\alpha = .05\) (Baayen 2008)
  • Confidence intervals: 95% CI shows the range of hypotheses that can be ruled out for \(\alpha = .05\)
confint(m, "nptypesimple")
             2.5 % 97.5 %
nptypesimple -61.8  -4.93
  • But who needs p-values anyway :)

Crossed random intercepts for participants and items

summary(m)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: rt ~ nptype + (1 | ppt) + (1 | item)
   Data: data

REML criterion at convergence: 14289

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-4.518 -0.596 -0.140  0.448  4.965 

Random effects:
 Groups   Name        Variance Std.Dev.
 item     (Intercept)  1649     40.6   
 ppt      (Intercept) 25996    161.2   
 Residual             54536    233.5   
Number of obs: 1036, groups:  item, 48; ppt, 12

Fixed effects:
             Estimate Std. Error     df t value Pr(>|t|)    
(Intercept)    1111.8       48.0   11.9    23.1  3.1e-11 ***
nptypesimple    -33.4       14.5  977.9    -2.3    0.022 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
nptypesimpl -0.151

Random intercepts (variances)

Random intercepts (variances)

Predicted outcomes

summary(m)$coef
             Estimate Std. Error    df t value Pr(>|t|)
(Intercept)    1111.8       48.0  11.9    23.2 3.11e-11
nptypesimple    -33.4       14.5 977.9    -2.3 2.16e-02
(coefs <- m@beta) # extract estimates
[1] 1111.8  -33.4
# Predicted mean for np conjoined
coefs[1] + coefs[2] * 0
[1] 1112
# Predicted mean for np simple
coefs[1] + coefs[2] * 1
[1] 1078
# Check predictor contrasts (default: treatment)
contrasts(factor(data$nptype))
          simple
conjoined      0
simple         1

What does the model predict for participants and item?

data_pred <- expand.grid(item = 1:48, ppt = 1:12, nptype = unique(data$nptype)) %>% 
  mutate(pred = predict(m, newdata = .),
         m = "random intercepts")

Random intercepts and random slopes models

  • Random slopes: NP-type effect is differently strong for certain participants and items.
  • Correlation between random intercepts and slopes: E.g. NP-type effect might be weaker for participants / items with longer rts.
  • Maximal random effects structure (Barr et al. 2013): for repeated measures within participants design we need random intercepts for participants and items with by-participants and by-items random slopes for NP type.
select(data, item, nptype) %>% unique() %>% arrange(item, nptype)
# A tibble: 96 × 2
   item nptype   
  <dbl> <chr>    
1     1 conjoined
2     1 simple   
3     2 conjoined
4     2 simple   
5     3 conjoined
# ℹ 91 more rows
select(data, ppt, nptype) %>% unique() %>% arrange(ppt, nptype)
# A tibble: 24 × 2
    ppt nptype   
  <dbl> <chr>    
1     1 conjoined
2     1 simple   
3     2 conjoined
4     2 simple   
5     3 conjoined
# ℹ 19 more rows

Random intercepts and random slopes models

# Maximal random effects structure (with correlation)
m_max <- lmer(rt ~ nptype + (nptype|ppt) + (nptype|item), data = data)

# Maximal random effects structure (without correlation)
m_max_nocor <- lmer(rt ~ nptype + (nptype||ppt) + (nptype||item), data = data)

Random intercepts and random slopes models

bind_rows(data_pred, data_pred_cor, data_pred_nocor) %>% 
  filter(ppt %in% c(7, 11)) %>% 
  mutate(across(m, ~case_when(str_detect(., "intercepts") ~ "\n(1|ppt) + (1|item)",
                              str_detect(., "without") ~ "\n(nptype||ppt) + (nptype||ppt)",
                              str_detect(., "with ") ~ "\n(nptype|ppt) + (nptype|ppt)"))) %>% 
  rename(`random effects` = m) %>% 
  ggplot(aes(x = nptype, y = pred, group = item)) +
  geom_point(size = .1) +
  geom_line(linewidth = .1) +
  facet_grid(ppt ~ `random effects`, scales = "free", labeller = label_both) +
  labs(y = "predicted rt", caption = "Lines represent different items")

Model coefficients of maximal random-effects model

# Maximal random effects structure (with correlation)
m_max <- lmer(rt ~ nptype + (nptype|ppt) + (nptype|item), data = data)
summary(m_max)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: rt ~ nptype + (nptype | ppt) + (nptype | item)
   Data: data

REML criterion at convergence: 14288

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-4.505 -0.596 -0.147  0.449  5.002 

Random effects:
 Groups   Name         Variance Std.Dev. Corr 
 item     (Intercept)   2749.8   52.44        
          nptypesimple   522.1   22.85   -1.00
 ppt      (Intercept)  27159.7  164.80        
          nptypesimple    52.4    7.24   -1.00
 Residual              54359.2  233.15        
Number of obs: 1036, groups:  item, 48; ppt, 12

Fixed effects:
             Estimate Std. Error     df t value Pr(>|t|)    
(Intercept)    1111.9       49.3   11.5   22.58    7e-11 ***
nptypesimple    -33.6       15.0  125.7   -2.24    0.027 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
nptypesimpl -0.310
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')

Hypothetical examples for maximal random effects structure

Task: pair designs with their corresponding maximal random-effects structure!

  • Design 1 (between participants / within items): both NP type conditions were presented for each item set but half of the participants saw simple NPs and the other half saw conjoined NPs.
  • Design 2 (within participants / within items): each participant saw both conditions and each item was presented in both conditions.
  • Design 3 (within participants / between items): half of all items were conjoined and the other half were simple NP type; each participant saw both condition (simple NPs and conjoined NPs).
  • Design 4 (between participants / between items): half of all items were conjoined and half were simple NP type; half of the participants saw simple NPs and the other half saw conjoined NPs.
Model_A <- lmer(rt ~ nptype + (1|ppt) + (nptype|item), data = data) 
Model_B <- lmer(rt ~ nptype + (1|ppt) + (1|item), data = data)
Model_C <- lmer(rt ~ nptype + (nptype|ppt) + (nptype|item), data = data) 
Model_D <- lmer(rt ~ nptype + (nptype|ppt) + (1|item), data = data) 

Exercise

Complete exercise script linear-models/exercises/mixed_effects_model_rirs.R

Generalised linear models

Type of outcome variable

  • Continuous: lme4::lmer(dv ~ iv + (random slopes | random intercepts), data = data)
  • Binary: lme4::glmer(dv ~ iv + (random slopes | random intercepts), data = data, family = binomial())
  • Count: lme4::glmer(dv ~ iv + (random slopes | random intercepts), data = data, family = poisson())
    • Also negative-binomial lme4::glmer.nb and zero-inflated count models NBZIMM::glmm.zinb
  • Ordinal: ordinal::clmm(dv ~ iv + (random slopes | random intercepts), data = data)

Frequentist mixed-effects models require familiarity with different packages.

Convergence failures

  • Maximal random effects structure were introduced as the gold-standard for repeated-measure experiments (Barr et al. 2013; Baayen, Davidson, and Bates 2008).
  • These are typically easy to fit for simple designs with balanced samples (few missing data) or indeed simulations (Barr et al. 2013).
  • Convergence failures for designs as simple 2$$2 repeated-measures designs are well documented in the literature (Bates et al. 2015; Eager and Roy 2017; Kimball et al. 2016).
  • Overparametrisation: unidentifiable parameter estimates (Bates et al. 2015)
  • lme4 author Bates proposed ways to deal with convergence failures by using “parsimonious” random effects structures (Bates et al. 2015)
  • Model selection based on failure to fit a more complex one; some sources of random errors will not be addressed.
  • Alternatively use Bayesian models: Bayesian model converge by definition (as the number of iterations approaches \(\infty\)).

Top tip:

  • Convergence failures (and other problems like floor and ceiling effects) can be (easily) addressed in Bayesian models.
  • There is no other R package that allows a straightforward implementation of as many probability models as brms (Bürkner 2017) using a syntax that roughly mimics lme4
  • More flexibility is on possible when using Stan directly (Annis, Miller, and Palmeri 2017; Carpenter et al. 2017).

Fitting a model in brms

(Bürkner 2017, 2018)

brms

  • R package for Bayesian models
  • (Almost) no more complicated to fit than lme4s.
  • More probability models than other regressions packages:
    • gaussian, lognormal, bernoulli, poisson, zero_inflated_poisson, skew_normal, shifted_lognormal, exgaussian, et cetera
    • and allows mixture models, nonlinear syntax, (un)equal variance signal-detection theory, multivariate models


  • Under the hood: brms creates Stan code to compile a probabilistic MCMC (Markov Chain Monte Carlo) sampler.
  • Compiling the sampler and for the sampler to obtain the full posterior can take times.

Exercise

Complete script linear-models/exercises/mixed_effects_model_brms.R

R syntax

Frequentist mixed-effects models

library(lme4)
fit_lmer <- lmer(rt ~ nptype + (nptype|ppt) + (nptype|item), data = data)

Bayesian mixed-effects models

library(brms)
fit_brm <- brm(rt ~ nptype + (nptype|ppt) + (nptype|item), data = data)

Fit models on simulated data

coef(summary(fit_lmer)) # Coefficients of frequentist model
             Estimate Std. Error    df t value Pr(>|t|)
(Intercept)    1111.9       49.3  11.5   22.58 6.96e-11
nptypesimple    -33.6       15.0 125.7   -2.24 2.70e-02
fixef(fit_brm) # Coefficients of Bayesian model
             Estimate Est.Error   Q2.5    Q97.5
Intercept      1108.0      50.2 1010.1 1206.889
nptypesimple    -33.2      17.2  -66.3    0.749

Why Bayesian?

see e.g. Kruschke (2014)

Two universes

  • Two different ways of generalising from sample to the population:
  • What do the data tell us about the population?
  • Null-hypothesis significance testing
  • Bayesian inference

Null-hypothesis significance testing

  • Classical / frequentist statistics; p-value based statistics
  • p-value: How plausible are the data (or something more extreme) if we assume that there’s no effect?
  • Evaluating the probability of data (e.g. t-value, F-statistic) assuming that the null hypothesis \(H_0\) is true.
  • If the data are extreme enough, \(H_0\) is concluded to be implausible (rejected).
  • If \(H_0\) is implausible, we assume \(H_1\) (alternative hypothesis).
  • Statistically significant means data are unexpected under \(H_0\).

\[ Pr(\text{data} \mid H_0 = \text{TRUE}) \]

Its problems

  • Statistical significance is not binary but continuous.
  • How often are we actually interested in or seriously believe in \(H_0\)?
  • What are you doing if \(p=0.051\)?
  • Stopping rule
  • \(\alpha\)-level of 0.05 implies that there is a 5% chance that:
    • our effect doesn’t replicate (isn’t real)
    • we will not observe an effect that is real
  • This seems even worse in reality (Amrhein, Trafimow, and Greenland 2019; Loken and Gelman 2017)


  • Trade-off of statistical significance, effect size, and sample size:
    • Smaller effects can become significant when the sample is large enough.
    • But how plausible is a small effect (e.g. mean difference, correlation) for what we are interested in?
    • What really matters is that the size of the effect is what we would expect it to be.
  • Systematic misinterpretations of p-values (Colquhoun 2017; Greenland et al. 2016) and CIs (Hoekstra et al. 2014; Morey et al. 2016).

Bayesian inference

  • Expressing uncertainty about a hypothesis expressed in probability distributions
  • Updating one’s belief (prior, example later) about a hypothesis (e.g. difference between conditions, interactions etc) as the new information becomes available.

\[ Pr(H \mid \text{data}) \]

Terminology: Bayes’ Theorem

\[ \underbrace{Pr(\theta \mid \text{data})}_{\text{posterior}} \propto \underbrace{Pr(\theta)}_{\text{prior}} \cdot \underbrace{Pr(\text{data} \mid \theta)}_{\text{likelihood}} \] - Likelihood: probability of data given the model parameter value(s) - Prior: what did we already know about model parameter(s) - Posterior: our updated belief in the parameter value(s) after seeing the data

Why is Bayesian inference important?

It tells us what we want to know!


  • Summary stats of a Bayesian model are often very similar to what people think frequentist quantities (p-values, CIs) mean (Nicenboim and Vasishth 2016):
  • What’s the relative evidence for one model (e.g., \(H_0\)) vs. another (e.g., \(H_1\))?
  • What interval contains an unknown parameter value with .95 probability?

Why is Bayesian inference important?

Instead of concluding


\[ p(\text{data} \mid H_0 = \text{TRUE}) \]

we can answer


\[ p(H \mid \text{data}) \]

Why is Bayesian inference important?

  • What’s the probability of an effect given the data (and what we already knew before)?
  • Useful for small data sets for which prior information plays a huge role.
  • Probabilistic sampling can handle convergence failures, missing data problems, complex models (many parameters, sources of random error), ceiling / floor effects, identifiability issues.
  • Results are not dependent on sampling plan (Kruschke 2018).
  • Flexible data modeling is now easily available (brms, rstanarm, rethinking).

Why is Bayesian inference important?

  • Estimate uncertainty over parameter values based on the observed data (Kruschke 2014).
  • For this we need to start with
  1. a probability model (e.g. a normal distribution and a set of predictors)
  2. a prior (on e.g. means, difference, variability)
  • brms uses probabilistic sampling to determine the posterior uncertainty about the parameter value.
  • This involve calculating the weighted probability of the data in a potentially high dimensional parameter space (which can take time).

Convergence and model diagnostics

see Lambert (2018) for HMC

So what about this?

Progress of probabilistic sampler.

By default: - 2,000 iterations for parameter estimation per chain - iterations / 2 warm-up samples: discarded eventually - 4 chains to establish convergence - Change cores (check parallel::detectCores()) for running chains in parallel. - How many posterior samples have we got for inference (= chains * (iterations - warm-up))? - How many iterations / chains do you need?

Parameter estimation

  • Hamiltonian Monte Carlo: Markov chain Monte Carlo method for obtaining random samples.
  • Probability of random samples is estimated given the data and the prior.
  • Direction changes if probability of proposed estimate is lower than previous one.

Parameter estimation

  • Hamiltonian Monte Carlo: Markov chain Monte Carlo method for obtaining random samples.
  • Probability of random samples is estimated given the data and the prior.
  • Direction changes if probability of proposed estimate is lower than previous one.
sim <- fit_brm$fit@sim
samples <- map(1:4, ~sim$samples[[.]] %>%
                 as_tibble() %>%
      select(starts_with("b")) %>%
      mutate(chain = .x,
             iteration = 1:n())) %>% bind_rows() %>%
    pivot_longer(starts_with("b")) %>%
  mutate(name = factor(name, levels = unique(name)[c(1,2)], ordered = T),
         chain = factor(chain))
pivot_wider(samples, names_from = name, values_from = value) %>%
  filter(chain == 1) %>%
  mutate(chain = paste0("Chain: ", chain)) %>%
  ggplot(aes(x = b_Intercept, y = b_nptypesimple, colour = chain)) +
  scale_color_viridis_d("") +
  geom_path(show.legend = F, colour =  "white") +
  facet_wrap(~chain) +
  labs(title = "Parameter space") +
  scale_x_continuous(limits = c(-100, 1300), breaks = seq(-150, 1600, 250)) +
  scale_y_continuous(limits = c(-150, 100), breaks = seq(-150, 600, 50)) +
  theme(legend.justification = "top")

Parameter estimation

  • Hamiltonian Monte Carlo: Markov chain Monte Carlo method for obtaining random samples.
  • Probability of random samples is estimated given the data and the prior.
  • Direction changes if probability of proposed estimate is lower than previous one.
pivot_wider(samples, names_from = name, values_from = value) %>%
  filter(chain == 1) %>%
  mutate(chain = paste0("Chain: ", chain)) %>%
  ggplot(aes(x = b_Intercept, y = b_nptypesimple, colour = chain, alpha = iteration, label = iteration)) +
  scale_color_viridis_d("") +
  geom_path(show.legend = F) +
  geom_text(show.legend = F, size = 2, aes(alpha = -iteration)) +
  facet_wrap(~chain) +
  labs(title = "Parameter space") +
  scale_x_continuous(limits = c(-100, 1300), breaks = seq(-150, 1600, 250)) +
  scale_y_continuous(limits = c(-150, 100), breaks = seq(-150, 600, 50)) +
  theme(legend.justification = "top")

Parameter estimation

samples %>%
  filter(chain %in% 1, iteration %in% 1:100) %>%
  ggplot(aes(y=value, x = iteration, colour = chain)) +
  scale_color_viridis_d("Chains") +
  geom_path() +
  facet_wrap(~name, scales = "free") +
  labs(title = "Traceplot")

Parameter estimation

samples %>%
  filter(chain %in% 1:2, iteration %in% 1:100) %>%
  ggplot(aes(y=value, x = iteration, colour = chain)) +
  scale_color_viridis_d("Chains") +
  geom_path() +
  facet_wrap(~name, scales = "free") +
  labs(title = "Traceplot")

Parameter estimation

samples %>%
  filter(chain %in% 1:4, iteration %in% 1:100) %>%
  ggplot(aes(y=value, x = iteration, colour = chain)) +
  scale_color_viridis_d("Chains") +
  geom_path() +
  facet_wrap(~name, scales = "free") +
  labs(title = "Traceplot")

Parameter estimation

samples %>%
  filter(chain %in% 1:4, iteration %in% 1:250) %>%
  ggplot(aes(y=value, x = iteration, colour = chain)) +
  scale_color_viridis_d("Chains") +
  geom_path() +
  facet_wrap(~name, scales = "free") +
  labs(title = "Traceplot")

Parameter estimation

samples %>%
  filter(chain %in% 1:4, iteration %in% 1:500) %>%
  ggplot(aes(y=value, x = iteration, colour = chain)) +
  scale_color_viridis_d("Chains") +
  geom_path() +
  facet_wrap(~name, scales = "free") +
  labs(title = "Traceplot")

Parameter estimation

samples %>%
  filter(chain %in% 1:4, iteration %in% 1:2000) %>%
  ggplot(aes(y=value, x = iteration, colour = chain)) +
  scale_color_viridis_d("Chains") +
  geom_path() +
  facet_wrap(~name, scales = "free") +
  labs(title = "Traceplot")

Parameter estimation

samples %>%
  filter(chain %in% 1:4, iteration %in% 1000:2000) %>%
  ggplot(aes(y=value, x = iteration, colour = chain)) +
  scale_color_viridis_d("Chains") +
  geom_path() +
  facet_wrap(~name, scales = "free") +
  labs(title = "Traceplot")

Exercise: model diagnostics and convergence checks

  • Traceplots: we want fat hairy caterpillars
  • \(\hat{R}\) convergence statistic; should be \(<1.1\) (Gelman and Rubin 1992)
  • Posterior predictive checks: compare observed data \(y\) and model predictions \(y_{rep}\)
  • Complete script linear-models/exercises/model_diagnostics.R

Where are the priors?

chapter 7 in Lee and Wagenmakers (2014)

Priors

  • Prior knowledge about plausible parameter values.
  • This knowledge is expressed as probability distributions (e.g. normal distributions).
  • Help probabilistic sampler by limiting the parameter space.
  • Small data samples are sensitive to prior information which makes intuitively sense.
  • Otherwise data typically overcome the prior (automatic Ockham’s razor).
  • Less common: test data against a (prior) effect suggested by the literature.

Priors: intercept

  • A priori, each value in the parameter space is equally possible.
  • Let’s think about the parameter space for rts.
tibble(intercept = runif(1000, 0, 5000),
       slope = runif(1000,  -10000, 10000)) %>%
  ggplot(aes(x=intercept, y=slope)) +
  labs(subtitle = "Parameter space")

Priors: intercept

  • Can rts range between -\(\infty\) and \(\infty\)?
  • What are plausible lower and upper end?
  • Plausible probability distribution for pre-sentence pauses:

\[ \beta_0 \sim \mathcal{N}(1000 \text{ msecs}, ???) \]

tibble(intercept = runif(1000, 0, 5000),
       slope = runif(1000,  -10000, 10000)) %>%
  ggplot(aes(x=intercept, y=slope)) +
  labs(subtitle = "Parameter space")

Priors: intercept

  • Can rts range between -\(\infty\) and \(\infty\)?
  • What are plausible lower and upper end?
  • Plausible probability distribution for pre-sentence pauses:

\[ \beta_0 \sim \mathcal{N}(1000 \text{ msecs}, 150\text{ msecs}) \]

p <- tibble(intercept = rnorm(10000, mean = 1000, sd = 150),
       slope = runif(10000,  -10000, 10000)) %>%
  ggplot(aes(x=intercept, y=slope)) +
    geom_point(size = .25, colour = "transparent") + 
  geom_density_2d_filled(alpha = 0.25, show.legend = F) +
  geom_density_2d(alpha = 0.5, show.legend = F, size = .15, color = "black") +
  scale_fill_viridis_d(direction = -1, begin = 0, end = .6) +
  scale_x_continuous(limits = c(0, 5000)) +
  labs(subtitle = "Parameter space")

ggExtra::ggMarginal(p, type="density", size=10, alpha = .25, margins = "x")

Priors: slope

  • What do we know about the difference between simple and conjoined NPs?
  • Let’s pretend we don’t know much: so a priority there might be a mean difference of 0 msecs.
  • How large are rt differences usually? Anything larger than 200 msecs would be massive, so lets propose a humble standard deviation of 100 msecs

\[ \beta_1 \sim \mathcal{N}(0\text{ msecs}, 510\text{ msecs}) \]

p <- tibble(intercept = rnorm(10000, mean = 1000, sd = 250),
       slope = runif(10000,  -10000, 10000)) %>%
  ggplot(aes(x=intercept, y=slope)) +
    geom_point(size = .25, colour = "transparent") + 
  geom_density_2d_filled(alpha = 0.25, show.legend = F) +
  geom_density_2d(alpha = 0.5, show.legend = F, size = .15, color = "black") +
  scale_fill_viridis_d(direction = -1, begin = 0, end = .6) +
  scale_x_continuous(limits = c(0, 10000)) +
  labs(subtitle = "Parameter space")

ggExtra::ggMarginal(p, type="density", size=10, alpha = .25, margins = "x")

Priors: slope

  • What do we know about the difference between simple and conjoined NPs?
  • Let’s pretend we don’t know much: so a priority there might be a mean difference of 0 msecs.
  • How large are rt differences usually? Anything larger than 200 msecs would be massive, so lets propose a humble standard deviation of 100 msecs

\[ \beta_1 \sim \mathcal{N}(0\text{ msecs}, 100\text{ msecs}) \]

p <- tibble(intercept = rnorm(10000, mean = 1000, sd = 500),
       slope = rnorm(10000,  0, 100)) %>%
  ggplot(aes(x=intercept, y=slope)) +
    geom_point(size = .25, colour = "transparent") + 
  geom_density_2d_filled(alpha = 0.25, show.legend = F) +
  geom_density_2d(alpha = 0.5, show.legend = F, size = .15, color = "black") +
  scale_fill_viridis_d(direction = -1, begin = 0, end = .6) +
  scale_x_continuous(limits = c(0, 5000)) +
  scale_y_continuous(limits = c(-300,1000), breaks = seq(-300, 900, 250)) +
  labs(subtitle = "Parameter space")
ggExtra::ggMarginal(p, type="density", size=10, alpha = .25, margins = "both")

Priors

  • brms uses default priors that can be changed as appropriate.
  • Some defaults are good, (flat) priors are worth specifying.

Check defaults used earlier:


prior_summary(fit_brm)
prior class coef group
(flat) b
(flat) b nptypesimple
lkj(1) cor
lkj(1) cor item
lkj(1) cor ppt
student_t(3, 1039.5, 235) Intercept
student_t(3, 0, 235) sd
student_t(3, 0, 235) sd item
student_t(3, 0, 235) sd Intercept item
student_t(3, 0, 235) sd nptypesimple item
student_t(3, 0, 235) sd ppt
student_t(3, 0, 235) sd Intercept ppt
student_t(3, 0, 235) sd nptypesimple ppt
student_t(3, 0, 235) sigma

Priors

p <- tibble(intercept = rstudent_t(10000, 3, 1039.5, 235),
       slope = runif(10000, -2000, 2000)) %>%
  ggplot(aes(x=intercept, y=slope)) +
    geom_point(size = .25, colour = "transparent") + 
  geom_density_2d_filled(alpha = 0.25, show.legend = F) +
  geom_density_2d(alpha = 0.5, show.legend = F, size = .15, color = "black") +
  scale_fill_viridis_d(direction = -1, begin = 0, end = .6) +
  scale_x_continuous(limits = c(0, 5000)) +
  scale_y_continuous(limits = c(-1000, 1000), breaks = seq(-1000, 1000, 250)) +
  labs(subtitle = "Parameter space")
ggExtra::ggMarginal(p, type="density", size=10, alpha = .25, margins = "both")

Exercise

You will refit our model with more informative priors in the next script.

  • Complete script linear-models/exercises/mixed_effects_model_brms_with_priors.R

Hypothesis testing

see e.g. Nicenboim and Vasishth (2016) for a tutorial

What results do I report?

  • Summary of posterior parameter value estimates
  • To test a hypothesis of parameter estimate (e.g. difference between groups, interactions, change in outcome):
    • ROPE
    • Savage-Dickey Bayes Factor: for comparison of nested models
  • For comparison of models with different probability functions: Cross-validation (leave-one-out)

Posterior probability distribution

  • Get posterior of slope \(\beta\) (difference between conditions)
beta <- as_draws_df(fit_brm) %>% pull(b_nptypesimple)
length(beta)
[1] 4000
beta[1:5]
[1] -34.0 -45.2 -33.0 -37.6 -23.7

Posterior probability distribution

ggplot(data = NULL, aes(x = beta)) +
  geom_histogram(alpha = .55) +
  labs(x = bquote("Estimated effect"~hat(beta)))

Posterior probability distribution

  • Posterior mean
mean(beta)
[1] -33.2
ggplot(data = NULL, aes(x = beta)) +
  geom_histogram(alpha = .55) +
  labs(x = bquote("Estimated effect"~hat(beta))) +
  geom_vline(xintercept = mean(beta), color = "darkred")

Posterior probability distribution

  • 95% probability interval (conceptually different from confidence interval)
quantile(beta, probs = c(0.025, 0.975))
   2.5%   97.5% 
-66.281   0.749 
ggplot(data = NULL, aes(x = beta)) +
  geom_histogram(alpha = .55, bins = 50, position = "identity", show.legend = F) +
  labs(x = bquote("Estimated effect"~hat(beta))) +
  geom_errorbarh(aes(y = 50, xmin = quantile(beta, probs = c(0.025))
, xmax =  quantile(beta, probs = c(0.975))), color = "darkred", size = 1.5, ) +
  annotate("text", x = 15, y = 55, label = "95% PI", colour = "darkred")

Posterior probability distribution

  • 95% probability interval (conceptually different from confidence interval)
quantile(beta, probs = c(0.025, 0.975))
   2.5%   97.5% 
-66.281   0.749 
  • 89% probability interval (McElreath 2020)
quantile(beta, probs = c(0.055, 0.945))
  5.5%  94.5% 
-60.83  -6.53 
ggplot(data = NULL, aes(x = beta)) +
  geom_histogram(alpha = .55, bins = 50, position = "identity", show.legend = F) +
  labs(x = bquote("Estimated effect"~hat(beta))) +
  geom_errorbarh(aes(y = 75, xmin = quantile(beta, probs = c(0.055))
, xmax =  quantile(beta, probs = c(0.945))), color = "darkred", size = 1.5, ) +
  annotate("text", x = 15, y = 90, label = "89% PI", colour = "darkred")

Posterior probability distribution

  • Probability that slope \(\beta\) is negative: \(P(\hat{\beta}<0)\)
mean(beta < 0)
[1] 0.973
  • See exercises: linear-models/exercises/hypothesis_testing_post.R
beta2 <- beta[beta < 0]
ggplot(data = NULL, aes(x = beta, fill = beta > 0)) +
  geom_histogram(alpha = .55, bins = 50, position = "identity", show.legend = F) +
  scale_fill_manual(values = c("darkred", "grey")) +
  geom_vline(xintercept = 0, colour = "grey20", linetype = "dotted") +
  labs(x = bquote("Estimated effect"~hat(beta))) 

Posterior probability distribution

  • Probability that slope \(\beta\) is negative: \(P(\hat{\beta}<0)\)
mean(beta < -10)
[1] 0.917

ROPE

e.g. Kruschke (2010)

ROPE

quantile(beta, probs = c(.025, .5, .975))
   2.5%     50%   97.5% 
-66.281 -32.873   0.749 
mean(beta < 0)
[1] 0.973

There is nothing really special about 0.

ggplot(data = NULL, aes(x = beta, fill = beta > -10)) +
  geom_histogram(alpha = .55, bins = 50, position = "identity", show.legend = F) +
  geom_vline(xintercept = -10, colour = "grey20", linetype = "dotted") +
  scale_fill_manual(values = c("darkred", "grey")) +
  labs(x = bquote("Estimated effect"~hat(beta))) 

ROPE

  • Point values don’t represent our beliefs about what the null hypothesis is like: \(H_0 = 0\)
  • Region of practical equivalence: range of values that are practically equivalent to a null effect(Kruschke 2010, 2011).
  • Define region that is equivalent to no effect: all values in an interval of [-5, 5] msecs
ggplot(data = NULL, aes(x = beta)) +
  geom_histogram(alpha = .55, bins = 50, position = "identity", show.legend = F) +
  labs(x = bquote("Estimated effect"~hat(beta))) 

ROPE

  • Point values don’t represent our beliefs about what the null hypothesis is like: \(H_0 = 0\)
  • Region of practical equivalence: range of values that are practically equivalent to a null effect(Kruschke 2010, 2011).
  • Define region that is equivalent to no effect: all values in an interval of [-5, 5] msecs
library(bayestestR)
rope(beta, range = c(-5, 5)) 
# Proportion of samples inside the ROPE [-5.00, 5.00]:

inside ROPE
-----------
2.18 %     
ggplot(data = NULL, aes(x = beta)) +
  geom_histogram(alpha = .55, bins = 50, position = "identity", show.legend = F) +
  labs(x = bquote("Estimated effect"~hat(beta))) 

ROPE

  • Point values don’t represent our beliefs about what the null hypothesis is like: \(H_0 = 0\)
  • Region of practical equivalence: range of values that are practically equivalent to a null effect(Kruschke 2010, 2011).
  • Define region that is equivalent to no effect: all values in an interval of [-5, 5] msecs
library(bayestestR)
rope(beta, range = c(-5, 5)) 
# Proportion of samples inside the ROPE [-5.00, 5.00]:

inside ROPE
-----------
2.18 %     
ggplot(data = NULL, aes(x = beta, fill = beta > -5 & beta < 5)) +
  geom_histogram(alpha = .55, bins = 50, position = "identity", show.legend = F) +
  geom_vline(xintercept = 5, colour = "grey20", linetype = "dotted") +
  geom_vline(xintercept = -5, colour = "grey20", linetype = "dotted") +
  scale_fill_manual(values = c("grey", "darkred")) +
  labs(x = bquote("Estimated effect"~hat(beta))) 

ROPE

  • Region of practical equivalence: range of values that are practically equivalent (Kruschke 2010, 2011).
  • Point values don’t represent our beliefs about what the null hypothesis is like: \(H_0 = 0\)
  • The ROPE is the range of values that are practically equivalent to a null effect.
  • Define region that is equivalent to no effect: all values in interval [-Inf, 5]
library(bayestestR)
rope(beta, range = c(-10, Inf))
# Proportion of samples inside the ROPE [-10.00, Inf]:

inside ROPE
-----------
6.05 %     
ggplot(data = NULL, aes(x = beta, fill = beta > -10 & beta < Inf)) +
  geom_histogram(alpha = .55, bins = 50, position = "identity", show.legend = F) +
  geom_vline(xintercept = -10, colour = "grey20", linetype = "dotted") +
  scale_fill_manual(values = c("grey", "darkred")) +
  labs(x = bquote("Estimated effect"~hat(beta))) 

ROPE

  • Determine ROPE to assess the uncertainty of effect size (Makowski, Ben-Shachar, and Lüdecke 2019)
  • Calculate the standardised effect size \(\delta\) from posterior: \(\delta = \frac{\beta}{\sigma}\)
  • ROPE of [-0.1, 0.1] is the range of negligible effect sizes (Cohn 1988; Kruschke 2018).

ROPE

  • Determine ROPE to assess the uncertainty of effect size (Makowski, Ben-Shachar, and Lüdecke 2019)
  • Calculate the standardised effect size \(\delta\) from posterior: \(\delta = \frac{\beta}{\sigma}\)
  • ROPE of [-0.1, 0.1] is the range of negligible effect sizes (Cohn 1988; Kruschke 2018).
sigma <- as_draws_df(fit_brm) %>% pull(sigma)
delta <- beta / sigma
abs(mean(delta)) # very small effect (see Cohen's d)
[1] 0.142

ROPE

  • Determine ROPE to assess the uncertainty of effect size (Makowski, Ben-Shachar, and Lüdecke 2019)
  • Calculate the standardised effect size \(\delta\) from posterior: \(\delta = \frac{\beta}{\sigma}\)
  • ROPE of [-0.1, 0.1] is the range of negligible effect sizes (Cohn 1988; Kruschke 2018).
rope(delta, range = c(-0.1, 0.1)) 
# Proportion of samples inside the ROPE [-0.10, 0.10]:

inside ROPE
-----------
26.34 %    
  • The ROPE value indicates the extent to which the posterior cannot rule out a negligible effect.
  • A meaningful effect size should have a small proportion of posterior samples within the ROPE.

Exercise

Complete script linear-models/exercises/hypothesis_testing_rope.R

Bayes Factor

e.g. chapter 7.6 in Lee and Wagenmakers (2014)

Bayes Factor

  • p-values have two possible outcomes: >1. reject the null (i.e. \(p<0.05\)) >2. inconclusive (i.e. \(p>0.05\) – go back to start)

  • BFs have a continuum of three possible outcomes (Dienes 2014, 2016; Dienes and Mclatchie 2018): >1. Reject the null / alternative hypothesis >2. Inconclusive (get more data!) >3. Accept the null / alternative hypothesis

Bayes Factor

\[ \text{BF} = \frac{p(H_1 \mid y)}{p(H_0 \mid y)} \]

  • How much more probable is one model (hypothesis) over the other?
  • How convinced should we be about the evidence for our hypothesis \(H_1\) as opposed to, say, a null hypothesis \(H_0\)?
  • Savage-Dickey density ratio for nested models (Jeffreys 1961; Dickey, Lientz, et al. 1970):
  • Height of the posterior density at zero compared to the height of the prior density at zero.

Bayes Factor: Savage-Dickey density ratio

  • Posterior: \(\mathcal{N}(-32.2, 17.4)\)
  • Prior: \(\mathcal{N}(0, 100)\)
dnorm(0, 0, 100)
[1] 0.00399
dnorm(0, mean(beta), sd(beta))
[1] 0.00361
dnorm(0, 0, 100) / dnorm(0, mean(beta), sd(beta))
[1] 1.11
pd <- tibble(x = seq(-200, 200, by = 10),
             Prior = dnorm(x, 0, 100),
             Posterior = dnorm(x, mean(beta), sd(beta))) %>%
  pivot_longer(Prior:Posterior)

data_at_zero <- filter(pd, x == 0)
pr <- filter(data_at_zero, name == "Prior") %>% pull(value)
po <- filter(data_at_zero, name == "Posterior") %>% pull(value)

ggplot(pd, aes(x = x, y = value, colour = name)) +
  geom_line() +
#  geom_point(data = data_at_zero, aes(x = x, y = value), 
#             colour = "black", size = 2) +
#  geom_label(data = data_at_zero, aes(x = x, y = value, colour = name, label = round(value, 3)), 
#             colour = "black", size = 3, position = position_dodge(.0001)) +
  scale_color_colorblind("") +
  geom_vline(xintercept = 0, linetype = "dotted") +
  labs(y = "density", x = bquote(hat(beta))) #, subtitle = bquote("BF"==.(round(pr/po,2)))) +

Make sure you know your priors

  • Posterior: \(\mathcal{N}(-32.2, 17.4)\)
  • Prior: \(\mathcal{N}(0, 25)\)
dnorm(0, 0, 25)
[1] 0.016
dnorm(0, mean(beta), sd(beta))
[1] 0.00361
dnorm(0, 0, 25) / dnorm(0, mean(beta), sd(beta))
[1] 4.42
pd <- tibble(x = seq(-200, 200, by = 10),
             Prior = dnorm(x, 0, 25),
             Posterior = dnorm(x, mean(beta), sd(beta))) %>%
  pivot_longer(Prior:Posterior)

data_at_zero <- filter(pd, x == 0)
pr <- filter(data_at_zero, name == "Prior") %>% pull(value)
po <- filter(data_at_zero, name == "Posterior") %>% pull(value)

ggplot(pd, aes(x = x, y = value, colour = name)) +
  geom_line() +
#  geom_point(data = data_at_zero, aes(x = x, y = value), 
#             colour = "black", size = 2) +
#  geom_label(data = data_at_zero, aes(x = x, y = value, colour = name, label = round(value, 3)), 
#             colour = "black", size = 3, position = position_dodge(.0001)) +
  scale_color_colorblind("") +
  geom_vline(xintercept = 0, linetype = "dotted") +
  labs(y = "density", x = bquote(hat(beta))) #, subtitle = bquote("BF"==.(round(pr/po,2)))) +

Make sure you know your priors

  • Posterior: \(\mathcal{N}(-32.2, 17.4)\)
  • Prior: \(\mathcal{N}(0, 2000)\)
dnorm(0, 0, 2000)
[1] 0.000199
dnorm(0, mean(beta), sd(beta))
[1] 0.00361
dnorm(0, 0, 2000) / dnorm(0, mean(beta), sd(beta))
[1] 0.0553
pd <- tibble(x = seq(-200, 200, by = 10),
             Prior = dnorm(x, 0, 2000),
             Posterior = dnorm(x, mean(beta), sd(beta))) %>%
  pivot_longer(Prior:Posterior)

data_at_zero <- filter(pd, x == 0)
pr <- filter(data_at_zero, name == "Prior") %>% pull(value)
po <- filter(data_at_zero, name == "Posterior") %>% pull(value)

ggplot(pd, aes(x = x, y = value, colour = name)) +
  geom_line() +
#  geom_point(data = data_at_zero, aes(x = x, y = value), 
#             colour = "black", size = 2) +
#  geom_label(data = data_at_zero, aes(x = x, y = value, colour = name, label = round(value, 3)), 
#             colour = "black", size = 3, position = position_dodge(.0001)) +
  scale_color_colorblind("") +
  geom_vline(xintercept = 0, linetype = "dotted") +
  labs(y = "density", x = bquote(hat(beta))) #, subtitle = bquote("BF"==.(round(pr/po,2)))) +

Bayes Factor

  • More informative priors give more weight to \(H_0\).
  • Given such a prior, a posterior that favours \(H_1\) would be more convincing: BFs capture this.
  • Regularising (skeptical) priors prevent the model to get overexcited by the sample (his words: McElreath 2020).
  • If in doubt use weakly informative priors like student-t distributions that allow for larger tails, e.g. student_t(3, 0, 100)

Exercise

  • Explore hypothesis() in the exercise: linear-models/exercises/hypothesis_testing_bf_1.R
  • Bonus: explore the Savage-Dickey method in linear-models/exercises/hypothesis_testing_bf_2.R

Bayes Factor with hypothesis()

H: there is no difference between conjoined and simple NPs

hypothesis(fit_brm, "nptypesimple = 0")
Hypothesis Tests for class b:
          Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob
1 (nptypesimple) = 0    -33.2      17.2    -66.3     0.75         NA        NA
  Star
1     
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

H: simple NPs are faster than conjoined NPs

hypothesis(fit_brm, "nptypesimple < 0")
Hypothesis Tests for class b:
          Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob
1 (nptypesimple) < 0    -33.2      17.2    -61.4    -5.88         36      0.97
  Star
1    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

H: simple NPs are more than 10 msecs faster than conjoined NPs

hypothesis(fit_brm, "nptypesimple < -10")
Hypothesis Tests for class b:
                Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
1 (nptypesimple)-(-10) < 0    -23.2      17.2    -51.4     4.12       11.1
  Post.Prob Star
1      0.92     
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

Compare probability models of rt data

see e.g. Matzke and Wagenmakers (2009); Roeser et al. (2024)

Families of distributions

Families of distributions


“Models are devices that connect theories to data. A model is an instantiation of a theory […]” (Rouder, Morey, and Wagenmakers 2016, 2)


  • Our models describe how we understand reality.
  • Model allows us to describe reality qualitatively (with their parameters) and quantitatively (parameter value estimates).
  • Interpretation of parameters depends on data-modeling context.
  • At minimum, distribution families (probability models) depend on data type.

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = gaussian())

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
  • Bernoulli: binomial outcomes (yes / no; correct / incorrect) (Winter and Bürkner 2021)
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = bernoulli())

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
  • Bernoulli: binomial outcomes (yes / no; correct / incorrect) (Winter and Bürkner 2021)
  • Poisson: count data (number of words / mistakes)
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = poisson())

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
  • Bernoulli: binomial outcomes (yes / no; correct / incorrect)
  • Poisson: count data (number of words / mistakes) (Winter and Bürkner 2021)
  • Zero-inflated Poisson: count data with lots of zeros (number of typos per word)
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = zero_inflated_poisson())

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
  • Bernoulli: binomial outcomes (yes / no; correct / incorrect)
  • Poisson: count data (number of words / mistakes) (Winter and Bürkner 2021)
  • Zero-inflated Poisson: count data with lots of zeros (number of typos per word)
  • Cumulative: ordinal (ordered categorical) data (Bürkner and Vuorre 2019) (Likert scales)
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = cumulative()) # or acat, sratio

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
  • Bernoulli: binomial outcomes (yes / no; correct / incorrect)
  • Poisson: count data (number of words / mistakes) (Winter and Bürkner 2021)
  • Zero-inflated Poisson: count data with lots of zeros (number of typos per word)
  • Cumulative: ordinal (ordered categorical) data (Bürkner and Vuorre 2019) (Likert scales)
  • Lognormal: zero bound data with positive skew; also skewed / shifted (log)-Normal, Wiener Diffusion models etc. [for RT data; Matzke and Wagenmakers (2009)]
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = lognormal())

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
  • Bernoulli: binomial outcomes (yes / no; correct / incorrect)
  • Poisson: count data (number of words / mistakes) (Winter and Bürkner 2021)
  • Zero-inflated Poisson: count data with lots of zeros (number of typos per word)
  • Cumulative: ordinal (ordered categorical) data (Bürkner and Vuorre 2019) (Likert scales)
  • Lognormal: zero bound continuous data with positive skew
  • Ex-Gaussian: continuous data with positive skew
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = exgaussian())

Probability models of rt data

So far we used a Gaussian model which showed a poor fit to the data.

Alternative models: skewed normal, shifted (log)normal, Wiener Diffusion models, ex-Gaussian, mixture of lognormal etc. (see Matzke and Wagenmakers 2009; Roeser et al. 2024).

Gaussian

\[y \sim \mathcal{N}(\mu, \sigma^2)\]

  • Data generating process can be described as a normal distribution with mean \(\mu\) and standard deviation \(\sigma\).

Lognormal

Lognormal

\[ \log(y) \sim \mathcal{N}(\mu, \sigma^2) \]

  • Often used to address positive skew; e.g. response times (Baayen 2008)
  • Log-values are only defined if \(y \in [0, \infty]\)
  • De-emphasize large values over small values.
  • Models the percentage change and not absolute differences:
    • Rather than saying “keystroke intervals are 40 msecs slower”, you say “keystrokes are 7% slower”.
    • Difference of 40 msecs can be huge for fast typing but negligible for pauses around 10 secs.

Shifted lognormal

\[ \log(y - \exp(\text{ndt})) \sim \mathcal{N}(\mu, \sigma^2) \]

  • Variation of the lognormal distribution: a zero shift would be equivalent to lognormal
  • Includes a horizontal shift.
  • Useful for modeling rt data because they are typically not just
    • right-skewed
    • non-negative
  • But also
    • positively shifted: there is a minimum time required for any response

Accounts for a minimum rt: The shift parameter represents the minimum possible rt, accounting for the fact that no response can be instantaneous.

Shifted lognormal (simulated data)

Exercise

Fit two other probability models, namely a lognormal and a shifted lognormal model:

  • Complete script linear-models/exercises/mixed_effects_models_brms_lognormal.R
  • Complete script linear-models/exercises/mixed_effects_models_brms_shifted_lognormal.R

Model comparisons

chapter 11 in Gelman, Hill, and Vehtari (2020); chapter 6 in McElreath (2020)

The Gaussian model showed poor fit to data

Model comparisons

  • How accurately does the model predict new data.
  • \(R^2\) et al.: more complex models are generally better ( \(R^2\) can’t go down).
  • Overfitting: models with more parameters can make unreasonable predictions.
  • Generalisations to data that were used to fit the model are over optimistic (Gelman, Hill, and Vehtari 2020).
  • We will look at both fit and predictive performance of the model.

Compare probability models of rt data

file_dest <- "../plots/fit2data.pdf"
include_graphics(file_dest)

Leave-One-Out (LOO) cross validation

  • Cross validation: performance of a model on new data to remove the overfitting problem.
  • Leave-One-Out (LOO) Information Criterion (LOO-IC):
    • Train model on \(N-1\) observations
    • Predict remaining data point from training model.
    • Repeat process \(N\) times to predict every observation from a model of the remaining data.
  • Adding up prediction results gives an estimate of expected log-predictive density (\(elpd\)); i.e. approximation of results that would be expected for new data.
  • loo() uses the probability calculations to approximate LOO-IC (i.e. Pareto smoothed importance sampling) (Vehtari, Gelman, and Gabry 2015, 2017).

Leave-One-Out (LOO) cross validation

  • Approximation involved in loo() uses the log posterior predictive densities: how likely is each data point given the distribution parameter estimates?

Leave-One-Out (LOO) cross validation

  • Approximation involved in loo() uses the log posterior predictive densities: how likely is each data point given the distribution parameter estimates?

Leave-One-Out (LOO) cross validation

loo(fit_brm)
Computed from 4000 by 1036 log-likelihood matrix.

         Estimate   SE
elpd_loo  -7141.4 40.8
p_loo        41.6  3.6
looic     14282.8 81.6
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.5]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.
# product of n factors one for each data point: with theta being all parameters
# for LOO CV we perform exclude each data point one at a time which is equivalent to multiplying posterior by factor 1/p(y_i \mid \theta)
# LOO posterior excluding i is p(\theta \ mid y_{-i}) = p(\theta \mid y) / p(y_i \mid \theta) using the the 
# LOO distribution uses the posterior simulatios for \theta and giving each simulation a weight of 1 / p(y_i \mid \theta)
# Weighted simulations is used to approximate the predictice distribution of y_i (the held-out data point)
# $$p(\theta \mid y) \propto p(\theta) \prod^n_{i=1} p(y_i \mid \theta)$$
# pointwise density:
# Average likelihood of every observation in training sample
# this is done for each set of parameters sampled from the psoterior distribution
# Average liklihood for each obsersation
# Sum over all observations.
# resulting in the log-pointwise-predictive-density (lppd)

# effective number of parameters is the variance in log-likelihood for every observation: p_waic
# WAIC: -2*(lppd * p_waic)

Leave-One-Out (LOO) cross validation

loo(fit_brm)
Computed from 4000 by 1036 log-likelihood matrix.

         Estimate   SE
elpd_loo  -7141.4 40.8
p_loo        41.6  3.6
looic     14282.8 81.6
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.5]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.
  • elpd_loo: sum of means (expected log predictive density)
  • p_loo: sum of variances
  • looic: \(-2 \cdot (\)elpd_loo\(-\)p_loo\()\) (for deviance scale)

Leave-One-Out (LOO) cross validation

Computed from 4000 by 1036 log-likelihood matrix.

         Estimate   SE
elpd_loo  -7141.4 40.8
p_loo        41.6  3.6
looic     14282.8 81.6
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.5]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.
#difference between two deviances has a chi-squared distribution; factor of 2 scales it that way; also called the Bayesian deviance


# N of parameters
# number of different conjectures for causes of explanations of the data
# How much iformation do we want the model to provide
# how flexible is themodel in fitting the training samples
# penalty term
# expected distance between in-sample and out-of-sample deviance
  • elpd_loo and looic rarely have a direct interpretation (as opposed to loo_R2()): important are differences between models.
  • p_loo is the effective number of parameters: how flexible is the model fit.
loos <- log_lik(fit_brm) %>% as_tibble() %>% mutate(sample = 1:n()) %>%
  pivot_longer(-sample, names_to = "obs", values_to = "loglik") %>%
  group_by(obs) %>%
  summarise(mean_loglik = mean(exp(loglik)),
            var_loglik = var(loglik)) %>% 
  ungroup() %>%
  summarise(elpd_loo = sum(log(mean_loglik)), # sum of mean log lik
            p_loo = sum(var_loglik), # effective number of parameters
            looic = -2 * (elpd_loo - p_loo) # Bayesian deviance
            ) 

#loo_R2(mixturemodel)
#bayes_R2(mixturemodel)

Exercise

  • Complete the script linear-models/exercises/model_comparison.R
  • This script assumes that you have stored the posterior of four models: Gaussian, lognormal, and shifted lognormal

Model comparison

comparison elpd_diff SE elpd_ratio
fit_gaussian vs fit_lognormal -77.3 22.2 3.5
fit_gaussian vs fit_shifted_lognormal -75.6 23.2 3.3
fit_lognormal vs fit_shifted_lognormal 1.7 1.2 1.4
a elpd_diff is often reported as e.g. \(\Delta\widehat{elpd}\).

The end

Summary

  • brms isn’t much more complicated than lme4 and comes with more flexibility.
  • Priors require some thinking.
  • Bayesian models allow straight forward interpretation of parameter values without an arbitrary threshold for significance.

Recommended reading

References

Just in case

Student t-distribution

  • Symmetric continuous distribution with fat tails assigning more probability to extreme values.
  • \(\nu\) parameter controls the degrees of freedom
  • \(\nu = 1\) is a Cauchy distribution
  • When \(\nu \rightarrow \infty\) the t-distribution becomes Gaussian.
library(LaplacesDemon)

x <- seq(-10, 10, by = 0.1)

tibble("t0" = dnorm(x, mean=0, sd=1),
       "t1" = dst(x, mu=0, sigma=1, nu = 1),
       "t2" = dst(x, mu=0, sigma=1, nu = 5),
       "t3" = dst(x, mu=0, sigma=2, nu = 5),
       x = x) %>%
  pivot_longer(-x) %>%
  ggplot(aes(x = x, y = value, colour = name)) +
  geom_line() +
  scale_x_continuous(limits = c(-7, 7)) +
  scale_color_colorblind("Parameter values",
                         breaks = paste0("t",0:3), 
                         labels = c(bquote(mu==0*","~sigma==1),
                                    bquote(mu==0*","~sigma==1*","~nu==1),
                                    bquote(mu==0*","~sigma==1*","~nu==5),
                                    bquote(mu==0*","~sigma==2*","~nu==5))) +
  labs(x = "x", y = "density") +
  theme(legend.justification = "top") 

Amrhein, Valentin, David Trafimow, and Sander Greenland. 2019. “Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis If We Don’t Expect Replication.” The American Statistician 73 (sup1): 262–70.

Annis, Jeffrey, Brent J Miller, and Thomas J Palmeri. 2017. “Bayesian Inference with Stan: A Tutorial on Adding Custom Distributions.” Behavior Research Methods 49: 863–86.

Baayen, R. Harald. 2008. Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

Baayen, R. Harald, Douglas J. Davidson, and Douglas M. Bates. 2008. “Mixed-Effects Modeling with Crossed Random Effects for Subjects and Items.” Journal of Memory and Language 59 (4): 390–412.

Barr, Dale J, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and Language 68 (3): 255–78.

Bates, Douglas, Reinhold Kliegl, Shravan Vasishth, and Harald Baayen. 2015. “Parsimonious Mixed Models.” arXiv Preprint arXiv:1506.04967.

Bürkner, Paul-Christian. 2017. “Brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80: 1–28. https://doi.org/10.18637/jss.v080.i01.

———. 2018. “Advanced Bayesian Multilevel Modeling with the R Package brms.” The R Journal 10 (1): 395–411. https://doi.org/10.32614/RJ-2018-017.

———. 2019. “Bayesian Item Response Modeling in R with Brms and Stan.” arXiv Preprint arXiv:1905.09501.

Bürkner, Paul-Christian, and Matti Vuorre. 2019. “Ordinal Regression Models in Psychology: A Tutorial.” Advances in Methods and Practices in Psychological Science 2 (1): 77–101.

Carpenter, Bob, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76: 1–32.

Clayton, Aubrey. 2021. Bernoulli’s Fallacy: Statistical Illogic and the Crisis of Modern Science. Columbia University Press.

Cohn, J. 1988. Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Earlbam Associates.

Colquhoun, David. 2017. “The Reproducibility of Research and the Misinterpretation of p-Values.” Royal Society Open Science 4 (12): 171085.

Dickey, James M., B. P. Lientz, et al. 1970. “The Weighted Likelihood Ratio, Sharp Hypotheses about Chances, the Order of a Markov Chain.” The Annals of Mathematical Statistics 41 (1): 214–26.

Dienes, Zoltan. 2014. “Using Bayes to Get the Most Out of Non-Significant Results.” Frontiers in Psychology 5 (781): 1–17.

———. 2016. “How Bayes Factors Change Scientific Practice.” Journal of Mathematical Psychology 72: 78–89.

Dienes, Zoltan, and Neil Mclatchie. 2018. “Four Reasons to Prefer Bayesian Analyses over Significance Testing.” Psychonomic Bulletin & Review 25 (1): 207–18.

Eager, Christopher D, and Joseph Roy. 2017. “Mixed Models Are Sometimes Terrible.” Linguistic Society of America, Austin, TX.

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press.

Gelman, Andrew, and Donald B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4): 457–72.

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. 2016. “Statistical Tests, p Values, Confidence Intervals, and Power: A Guide to Misinterpretations.” European Journal of Epidemiology 31 (4): 337–50.

Hoekstra, Rink, Richard D. Morey, Jeffrey N. Rouder, and Eric-Jan Wagenmakers. 2014. “Robust Misinterpretation of Confidence Intervals.” Psychonomic Bulletin & Review 21 (5): 1157–64.

Jeffreys, Harold. 1961. The Theory of Probability. Vol. 3. Oxford: Oxford University Press, Clarendon Press.

Kimball, A, Kailen Shantz, C Eager, and Joseph Roy. 2016. “Beyond Maximal Random Effects for Logistic Regression: Moving Past Convergence Problems.” ArXiv e-Prints.

Kruschke, John K. 2010. “What to Believe: Bayesian Methods for Data Analysis.” Trends in Cognitive Sciences 14 (7): 293–300.

———. 2011. “Bayesian Assessment of Null Values via Parameter Estimation and Model Comparison.” Perspectives on Psychological Science 6 (3): 299–312.

———. 2014. “Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan.”

———. 2018. “Rejecting or Accepting Parameter Values in Bayesian Estimation.” Advances in Methods and Practices in Psychological Science 1 (2): 270–80.

Lambert, Ben. 2018. A Student’s Guide to Bayesian Statistics. Sage.

Lee, Michael D., and Eric-Jan Wagenmakers. 2014. Bayesian Cognitive Modeling: A Practical Course. Cambridge University Press.

Loken, Eric, and Andrew Gelman. 2017. “Measurement Error and the Replication Crisis.” Science 355 (6325): 584–85.

Makowski, Dominique, Mattan S Ben-Shachar, and Daniel Lüdecke. 2019. bayestestR: Describing Effects and Their Uncertainty, Existence and Significance Within the Bayesian Framework.” Journal of Open Source Software 4 (40): 1541.

Martin, Randi C, Jason E Crowther, Meredith Knight, Franklin P Tamborello II, and Chin-Lung Yang. 2010. “Planning in Sentence Production: Evidence for the Phrase as a Default Planning Scope.” Cognition 116 (2): 177–92.

Matzke, Dora, and Eric-Jan Wagenmakers. 2009. “Psychological Interpretation of the Ex- Gaussian and Shifted Wald Parameters: A Diffusion Model Analysis.” Psychonomic Bulletin & Review 16 (5): 798–817.

McElreath, Richard. 2020. Statistical Rethinking: A Bayesian course with examples in R and Stan. Vol. 2. CRC Press.

Morey, Richard D., Rink Hoekstra, Jeffrey N Rouder, Michael D. Lee, and Eric-Jan Wagenmakers. 2016. “The Fallacy of Placing Confidence in Confidence Intervals.” Psychonomic Bulletin & Review 23 (1): 103–23.

Nicenboim, Bruno, and Shravan Vasishth. 2016. “Statistical Methods for Linguistic Research: Foundational Ideas – Part II.” Language and Linguistics Compass 10 (11): 591–613.

Roeser, J., M. Torrance, M. Andrews, and T. Baguley. 2024. “No Default Syntactic Scope for Advance Planning in Sentence Production: Evidence from Finite Mixture Models,” December. https://doi.org/10.31219/osf.io/u7v36.

Rouder, Jeffrey N., Richard D. Morey, and Eric-Jan Wagenmakers. 2016. “The Interplay Between Subjectivity, Statistical Practice, and Psychological Science.” Collabra 2 (1): 1–12.

Sorensen, Tanner, S. Hohenstein, and Shravan Vasishth. 2016. “Bayesian Linear Mixed Models Using Stan: A Tutorial for Psychologists, Linguists, and Cognitive Scientists.” Quantitative Methods for Psychology 12 (3): 175–200.

Vasishth, Shravan, and Bruno Nicenboim. 2016. “Statistical Methods for Linguistic Research: Foundational Ideas – Part I.” arXiv Preprint arXiv:1601.01126.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2015. “Pareto Smoothed Importance Sampling.” arXiv Preprint arXiv:1507.02646.

———. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32.

Winter, Bodo, and Paul-Christian Bürkner. 2021. “Poisson Regression for Linguists: A Tutorial Introduction to Modelling Count Data with Brms.” Language and Linguistics Compass 15 (11): e12439.